3.2.4 Exercises

  1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

The plot is empty because no layers have been added to ggplot().

  1. How many rows are in mpg? How many columns?
mpg
## # A tibble: 234 x 11
##    manufac… model   displ  year   cyl trans  drv     cty   hwy fl    class
##    <chr>    <chr>   <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
##  1 audi     a4       1.80  1999     4 auto(… f        18    29 p     comp…
##  2 audi     a4       1.80  1999     4 manua… f        21    29 p     comp…
##  3 audi     a4       2.00  2008     4 manua… f        20    31 p     comp…
##  4 audi     a4       2.00  2008     4 auto(… f        21    30 p     comp…
##  5 audi     a4       2.80  1999     6 auto(… f        16    26 p     comp…
##  6 audi     a4       2.80  1999     6 manua… f        18    26 p     comp…
##  7 audi     a4       3.10  2008     6 auto(… f        18    27 p     comp…
##  8 audi     a4 qua…  1.80  1999     4 manua… 4        18    26 p     comp…
##  9 audi     a4 qua…  1.80  1999     4 auto(… 4        16    25 p     comp…
## 10 audi     a4 qua…  2.00  2008     4 manua… 4        20    28 p     comp…
## # ... with 224 more rows

234 rows and 11 columns in total.

  1. What does the drv variable describe? Read the help for ?mpg to find out.

drv describes whether the car is front-wheel drive, rear wheel drive, or 4wd.

  1. Make a scatterplot of hwy vs cyl.
ggplot(mpg, aes(x = hwy, y = cyl)) +
  geom_point()

  1. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
ggplot(mpg, aes(x = class, y = drv)) +
  geom_point()

Scatterplots are suitable for displaying continuous variables (e.g., cty and hwy). class and drv are discrete variables.

3.3.1 Exercises

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

To make the points blue, colour must be set manually (i.e., it must be located outside aes():

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
mpg
## # A tibble: 234 x 11
##    manufac… model   displ  year   cyl trans  drv     cty   hwy fl    class
##    <chr>    <chr>   <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr>
##  1 audi     a4       1.80  1999     4 auto(… f        18    29 p     comp…
##  2 audi     a4       1.80  1999     4 manua… f        21    29 p     comp…
##  3 audi     a4       2.00  2008     4 manua… f        20    31 p     comp…
##  4 audi     a4       2.00  2008     4 auto(… f        21    30 p     comp…
##  5 audi     a4       2.80  1999     6 auto(… f        16    26 p     comp…
##  6 audi     a4       2.80  1999     6 manua… f        18    26 p     comp…
##  7 audi     a4       3.10  2008     6 auto(… f        18    27 p     comp…
##  8 audi     a4 qua…  1.80  1999     4 manua… 4        18    26 p     comp…
##  9 audi     a4 qua…  1.80  1999     4 auto(… 4        16    25 p     comp…
## 10 audi     a4 qua…  2.00  2008     4 manua… 4        20    28 p     comp…
## # ... with 224 more rows

Categorical: manufacturer, model, trans, drv, fl, class
Continuous: displ, year, cty, hwy

This information is located below the variable name in the output (e.g., <chr> indicates a character string which is categorical).

  1. What happens if you map the same variable to multiple aesthetics?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = displ))

It creates a gradient along whichever axis the variable is assigned to.

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

stroke increases the border width of shapes. However, not all shapes have a border.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), stroke = 2, shape = 23)

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))

In this example, each observation with displ < 5 is grouped together (TRUE). All remaining observations form a second group (FALSE).

3.5.1 Exercises

  1. What happens if you facet on a continuous variable?

Each value of the continuous variable will be treated as a discrete category. This will usually result in a very large output! For example:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty, nrow = 2)

  1. What do the empty cells in the plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

It means there are no 4wd or rear wheel drive cars with 5 cylinders. Also, there are no rear wheel drive cars with 4 cylinders. No cars have 7 cylinders.

The plot above shows the same information as the plot with facet_grid(drv ~ cyl), but displayed in one plot rather than facetting a grid of multiple plots.

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

. is used when you do not want to facet in a row or column.

  1. Take the first faceted plot in this section:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

Faceting makes it easier to see the pattern within groups. As the dataset becomes larger, it becomes more preferential to use faceting when comparing groups instead of using the colour aesthetic. However, it is more difficult to detect the overall pattern in the data when using faceting.

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow and ncol lets you specify the number of rows and columns to display when faceting. dir is an example of an option that controls the layout of panels. facet_grid() does not have nrow and ncol arguments because it facets in a combination of two variables.

  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Using the variable with more unique levels in the columns produces a more readable output on most computer monitors.

3.6.1 Exercises

  1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

geom_line, geom_boxplot, geom_histogram, and geom_area.

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

I did not predict three separate lines. However, the mappings are global. Therefore, these are passed to geom_point and geom_smooth.

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class), show.legend = FALSE) + 
  geom_smooth()
## `geom_smooth()` using method = 'loess'

show.legend = FALSE removes the legend from the plot for class. If you remove show.legend, the legend appears in the plot. This is because show.legend is NA by default for geom_point (see ?geom_point).

  1. What does the se argument to geom_smooth() do?

It displays the confidence interval around smooth (this is TRUE by default).

level allows you to specify the level of the confidence interval to use (e.g., 0.95).

  1. Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'

No, both will look the same because they use the same mappings. In the first example, the mappings are specified at the global level. In the second example, the mappings are specified at the local level.

  1. Recreate the R code necessary to generate the following graphs.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 3) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 3) +
  geom_smooth(aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point(size = 3) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = drv), size = 3) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = drv), size = 3) +
  geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess'

ggplot(mpg, aes(x = displ, y = hwy, fill = drv)) +
  geom_point(shape = 21, colour = "white", size = 3, stroke = 3)

3.7.1 Exercises

  1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

The default geom associated with stat_summary() is geom_pointrange. geom_pointrange does not automatically compute the ymin or ymax values, so these need to be specified using fun.ymin and fun.ymax from stat_summary:

ggplot(data = diamonds) +
  geom_pointrange(mapping = aes(x = cut, y = depth),
                  stat = "summary",
                  fun.ymin = min,
                  fun.ymax = max,
                  fun.y = median)

  1. What does geom_col() do? How is it different to geom_bar()?

geom_col() is used when the heights of the bars represent values in the data. It uses stat_identity by default. In contrast, geom_bar() uses stat_count which bins the data prior to plotting.

  1. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

Some examples:

stat_identity appears the most common. However, geoms involving one variable only or a discrete variable tend to have a different stat default (e.g., geom_histogram).

  1. What variables does stat_smooth() compute? What parameters control its behaviour?

y, ymin, ymax, and se. Basically, predicted values and the confidence interval. method determines which function to use (e.g., lm) to calculate these variables.

  1. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

The problem with the above plots is that the proportions are calculated within each group, therefore showing a proportion of 1.00 for each cut. group = 1 must be set for calculating the correct proportions.

3.8.1 Exercises

  1. What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

Points may be overlapping each other (this is known as overplotting). To improve this plot, you could add some random noise to the points by using position = jitter, or, its useful shortcut, geom_jitter().

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point(position = "jitter")

  1. What parameters to geom_jitter() control the amount of jittering?

width and height.

  1. Compare and contrast geom_jitter() with geom_count().

Here is an example of geom_count():

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point() +
  geom_count()

geom_count() is similar to geom_point() but it maps the count to point areas. In contrast, geom_jitter() adds random noise to the each point to prevent overplotting.

  1. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

position = "dodge".

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot(mapping = aes(colour = drv))

3.9.1 Exercises

  1. Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(diamonds, aes(x = factor(1), fill = factor(cut))) + 
  geom_bar(width = 1) +
  coord_polar(theta = "y")

As noted in the documentation for coord_polar, these plots should be used with caution because the polar coordinates have major perceptual problems.

  1. What does labs() do? Read the documentation.

labs() allows you to modify axis, legend, and plot labels.

  1. What’s the difference between coord_quickmap() and coord_map()?

coord_map() projects the earth onto a flat 2D plane. It does not preserve straight lines. coord_quickmap() is a quick approximation that preserves straight lines. It requires less computation.

  1. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

There is a positive relationship between city and highway mpg.

coord_fixed() maintains the aspect ratio.

geom_abline() adds a reference line that highlights the positive relationship between these two variables. It also shows that cars get more miles per gallon on the highway than they do in the city.